Overview: This data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Key question: “What chemical properties are most important in terms of predicting the quality of wine?”
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our data consists of 13 variables, with almost 1,600 observations. There are 11 variables on the chemical properties of the wine.
As the key question is to understand which chemical properties are the most important of predicing the quality of red wine, I would like to first look at all the distribution plots for all the 11 chemical properties to see if there’s anything catching my eye.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
First, let’s take a look for the distribution of quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
All the red wine quality ratings are range from 3 - 8. There’s no any wine that was rated super bad(0-2) or super awesome(9-10). Moreover, most of the wines are rated between 5-6. The mean of the quality ratings is 5.636 and the median of the quality ratings is 6.000.
Now’s take a closer look at the 11 chemical properties within the data set.
The histogram of fixed.acidity shows that fixed.acidity is slightly skewed to the right. Let’s see the log plot.
The Log plot is much closer to the bell curve of a normal distribution. Let’s see the summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The mean for fixed.acidity is 8.32 and the median is 7.90. Most of the red wines have the fixed.acidity between 7.10 to 9.20.
The histogram of volatile.acidity shows that volatile.acidity is slightly right-skewed. Let’s the summary of volatile.acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most of the red wines have the volatile.acidity from about 0.40 to o.65.
The histogram of citric.acidity shows that most of the data has citric.acid = 0. This may need future exploration. Let’s see the summary for citric.acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The smallest value for citric.acid is 0.000 and the largest value for citric.acid is 1.000. However, 75% of the values are below 0.420. Let’s see the square root distribution.
Since in the first plot, 0 is the value that has the largest count, after taking the square root, most of the values are still 0. As most of the data are between 0 and 1, after taking the square root, they will become bigger value. As a result, there is a blank spot in the plot. I wonder how citric.acid is connected to quality, and I wonder if the citric.acid values are specific to certain alcoholic levels.
From the histogram of residual.sugar, we can see that it’s very long-tailed. Let’s see the log plot.
We can see from the log plot that the distribution for residual.sugar is more close to normal distribution. Let’s see the summary ofresidual.sugar:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most red wines have residual.sugar values between 1.9 to 2.6.
From the histogram of chlorides, we can see that it’s very long-tailed. Let’s see the log plot.
The Log plot is much closer to the bell curve of a normal distribution. Let’s see the summary for chlorides:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Most of the red wines have chlorides range between 0.07 to 0.09. 75% of the red wines have chlorides below 0.09.
The histogram of free,sulfur.dioxide is right-skewed. Let’s see the log plot.
Let’s see the summary for free.sulfur.dioxide:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The largest value for the free.sulfur.dioxide in a red wine is 72.00 which is far away from the 3rd Quantile(21.00) and median(14.00).
The histogram of total.sulfur.dioxide is long-tailed. Let’s see the log plot.
The log plot for total.sulfur.dioxide is close to the bell surve. Let’s see the summary of total.sulfur.dioxide:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
For the distribution of total.sulfur.dioxide, the mean value is 46.47, and the median value is 38.00. 75% of the total.sulfur.dioxide value is below 62.00 while the max value is 289.00.
The distribution for density shown in this histogram is close to the pattern for normal distribution. Let’s see the summary of density:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Combine the results of the histogram and summary, we can see the density is close to normal distribution, range from 0.9901 to 1.0037. The median(0.9968) and mean(0.9967) is almost the same!
The pattern for pH’s histogram is close to normal distribution. Let’s summary of pH:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The max value of pH is 4.010 which indicates that all the red wines are acid(pH<7). Combine the results of the histogram and summary, we can see the pH is close to normal distribution, range from 2.740 to 4.010. The median(3.3310) and mean(3.311) is almost the same!
It seems that the distribution for sulphates is right-skewed. Let’s see the log plot.
The log plot for sulphates is more close to the bell curve. Let’s see the summary of sulphates:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Most of the red wines have the value of sulphates between 0.55 to 0.73 while the Max value is 2.00.
The distribution for the alcohol shown in this histogram is right-skewed. Let’s see the log plot.
Let’s see the summary for alcohol:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol values for the red wines are range from 8.40 to 14.90. I wonder if the alcohol effects the quality.
There are 1,599 red wines in the data with 11 chemical properties(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol).
The main features in the data set are quality and alcohol. I’d like to determine which features are best for predicting the quality for red wine. I suspect volatile.acidity and some combination of the other variables can be used to build a predictive model to red wine quality.
volatile.acidity, residual.sugar, chlorides, density, sulphates and alcohol likely contribute to the quality of red wine. I think volatile.acidity and alcohol probably contribute most of the quality after researching information on red wines quality.
Not at the moment.
I log-transformed the right-skewed distributions.(fixed-acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates, alcohol)
## X fixed.acidity volatile.acidity citric.acid
## X 1.000 -0.268 -0.009 -0.154
## fixed.acidity -0.268 1.000 -0.256 0.672
## volatile.acidity -0.009 -0.256 1.000 -0.552
## citric.acid -0.154 0.672 -0.552 1.000
## residual.sugar -0.031 0.115 0.002 0.144
## chlorides -0.120 0.094 0.061 0.204
## free.sulfur.dioxide 0.090 -0.154 -0.011 -0.061
## total.sulfur.dioxide -0.118 -0.113 0.076 0.036
## density -0.368 0.668 0.022 0.365
## pH 0.136 -0.683 0.235 -0.542
## sulphates -0.125 0.183 -0.261 0.313
## alcohol 0.245 -0.062 -0.202 0.110
## quality 0.066 0.124 -0.391 0.226
## residual.sugar chlorides free.sulfur.dioxide
## X -0.031 -0.120 0.090
## fixed.acidity 0.115 0.094 -0.154
## volatile.acidity 0.002 0.061 -0.011
## citric.acid 0.144 0.204 -0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 -0.022
## pH -0.086 -0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 -0.221 -0.069
## quality 0.014 -0.129 -0.051
## total.sulfur.dioxide density pH sulphates alcohol
## X -0.118 -0.368 0.136 -0.125 0.245
## fixed.acidity -0.113 0.668 -0.683 0.183 -0.062
## volatile.acidity 0.076 0.022 0.235 -0.261 -0.202
## citric.acid 0.036 0.365 -0.542 0.313 0.110
## residual.sugar 0.203 0.355 -0.086 0.006 0.042
## chlorides 0.047 0.201 -0.265 0.371 -0.221
## free.sulfur.dioxide 0.668 -0.022 0.070 0.052 -0.069
## total.sulfur.dioxide 1.000 0.071 -0.066 0.043 -0.206
## density 0.071 1.000 -0.342 0.149 -0.496
## pH -0.066 -0.342 1.000 -0.197 0.206
## sulphates 0.043 0.149 -0.197 1.000 0.094
## alcohol -0.206 -0.496 0.206 0.094 1.000
## quality -0.185 -0.175 -0.058 0.251 0.476
## quality
## X 0.066
## fixed.acidity 0.124
## volatile.acidity -0.391
## citric.acid 0.226
## residual.sugar 0.014
## chlorides -0.129
## free.sulfur.dioxide -0.051
## total.sulfur.dioxide -0.185
## density -0.175
## pH -0.058
## sulphates 0.251
## alcohol 0.476
## quality 1.000
From a subset of the data, sulphates, citric.acid, total.sulfur.dioxide do not seem to have strong correlations with quality, but density is moderately correlated with alcohol. I want to look closer at scatter plots involving quality and some other variables like alcohol, volatile.acidity, and density.
Let’s see the scatter plot for quality and alcohol:
Seems that we can not clearly see the pattern from the plot. Let’s transfrom the data and try to add some jitter:
From the plot, we can see that the distribution in the scatter plot slightly shifts from the left-bottom corner to the top-right, this indicates that red wines’ quality is correlated to the alcohol values in the data. Most red wines in the data have alcohol percantage between 9 to 12.
##
## Call:
## lm(formula = quality ~ alcohol, data = subset(rw, alcohol <=
## quantile(rw$alcohol, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8489 -0.4065 -0.1787 0.5176 2.5909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.81782 0.17512 10.38 <2e-16 ***
## alcohol 0.36646 0.01672 21.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7083 on 1596 degrees of freedom
## Multiple R-squared: 0.2314, Adjusted R-squared: 0.2309
## F-statistic: 480.4 on 1 and 1596 DF, p-value: < 2.2e-16
From the summary, we know that based on R^2 value, alcohol explains 23.14% of red wines’ quality.
Let’s see the scatter plot for quality and volatile.acidity:
The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.
After applying jitter, alpha and log-transform, we can se the distribution of the data in the plot move from top-left to bottom-right which indicates that quality is negetively correlated with red wine quality. Most red wines have volatile.acidity between 0.25 to 0.75. We can tell that red wines with higher volatile.acidity tend to have lower quality while red wines with lower volatile.acidity tend to have better quality.
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = subset(rw, volatile.acidity <=
## quantile(rw$volatile.acidity, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.78977 -0.54547 -0.01325 0.47198 2.92568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.55757 0.05841 112.27 <2e-16 ***
## volatile.acidity -1.74500 0.10503 -16.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7436 on 1596 degrees of freedom
## Multiple R-squared: 0.1474, Adjusted R-squared: 0.1469
## F-statistic: 276 on 1 and 1596 DF, p-value: < 2.2e-16
From the summary, we know that based on R^2 value, alcohol only explains 14.74% of red wines’ quality.
Finally, let’s see the scatter plot between quality and density:
The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.
##
## Call:
## lm(formula = quality ~ density, data = subset(rw, density <=
## quantile(rw$density, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7918 -0.6200 0.1504 0.4262 2.5233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.43 10.60 7.779 1.31e-14 ***
## density -77.04 10.63 -7.247 6.62e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7952 on 1595 degrees of freedom
## Multiple R-squared: 0.03188, Adjusted R-squared: 0.03127
## F-statistic: 52.52 on 1 and 1595 DF, p-value: 6.616e-13
Based on the plot and the summary, we know that comparing density to quality, most red wines have density between 0.9950 to 0.9975. Quality and density is lack of correlation.
Next, I’ll look at the scatter plot between other chemical features with red wine quality. (sulphates, citric.acid, total.sulfur.dioxide, total.sulfur.dioxide)
First, let’s see the scatter plot for quality and sulphates:
The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.
It seems to be some correlation between quality and sulphates. Let’s dive into the summary for linear model to know more.
##
## Call:
## lm(formula = quality ~ sulphates, data = subset(rw, sulphates <=
## quantile(rw$sulphates, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.91625 -0.53267 0.07005 0.45363 2.39883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.73813 0.08061 58.78 <2e-16 ***
## sulphates 1.36990 0.11917 11.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7756 on 1595 degrees of freedom
## Multiple R-squared: 0.07651, Adjusted R-squared: 0.07593
## F-statistic: 132.1 on 1 and 1595 DF, p-value: < 2.2e-16
Based on R^2 value, sulphates only explains 7.651% of red wines’ quality!
Now, let’s see the scatter plot for quality and citric.acid:
The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.
##
## Call:
## lm(formula = quality ~ citric.acid, data = subset(rw, citric.acid <=
## quantile(rw$citric.acid, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.01809 -0.59820 0.09909 0.50922 2.59711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.37360 0.03371 159.384 <2e-16 ***
## citric.acid 0.97651 0.10144 9.627 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7847 on 1595 degrees of freedom
## Multiple R-squared: 0.05491, Adjusted R-squared: 0.05432
## F-statistic: 92.68 on 1 and 1595 DF, p-value: < 2.2e-16
Based on the plot and the summary, the horizontal strips in the plot and the R^2 values indicate that quality and citric.acid is lack of correlation.
Let’s see the scatter plot for quality and total.sulfur.dioxide:
The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.
##
## Call:
## lm(formula = quality ~ total.sulfur.dioxide, data = subset(rw,
## total.sulfur.dioxide <= quantile(rw$total.sulfur.dioxide,
## 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8299 -0.6300 0.1964 0.3858 2.5857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8772219 0.0348085 168.845 <2e-16 ***
## total.sulfur.dioxide -0.0052610 0.0006208 -8.475 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7893 on 1595 degrees of freedom
## Multiple R-squared: 0.04309, Adjusted R-squared: 0.04249
## F-statistic: 71.82 on 1 and 1595 DF, p-value: < 2.2e-16
Based on the plot and the summary, the horizontal strips in the plot indicate that quality and total.sulfur.dioxide is lack of correlation.
Red wine quality correlates with alcohol(23.14%) and volatile.acidity(14.74%) and slightly with sulphates(7.65%).
As alcohol increases, the variance in quality increase. In the plot of quality versus alcohol, there are horizontal bands where many red wines take on the same quality value at different alcohol points. Based on the R^2 value, alcohol explains only about 23 percent of the variance in quality. Other features of interest can be incorporated into the model to explain the variance in the quality.
Red wines with higher volatile.acidity tend to have lower quality while red wines with lower volatile.acidity tend to have better quality.
We found that quality is lack of correlation with citric.acid, total.sulfur.dioxide and density.
The quality of red wines is positively and strongly correlated with sulphates and slightly correlated with alcohol and volatile.acidity..
Let’s plot the relationship between quality and alcohol using volatile.acidity in color paramater.
I noticed that there are a few blank region in the data. This may due to the fact that quality datas are in interger format. The graph present linear relationship between quality and alcohol.
Let’s see the plot for alcohol and quality using sulphates in the color parameter.
The results from the log transformation shows that the darker points are on the bottom-left while light points are on the top-right.
Let’s plot the relationship of quality and volatile.acidity using sulphates in color:
From the plot we can see that quality and volatile.acidity are negatively related.
Next, let’s feed in the model:
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = rw)
## m2: lm(formula = quality ~ alcohol + sulphates, data = rw)
## m3: lm(formula = quality ~ alcohol + sulphates + volatile.acidity,
## data = rw)
##
## ==============================================================
## m1 m2 m3
## --------------------------------------------------------------
## (Intercept) 1.875*** 1.375*** 2.611***
## (0.175) (0.177) (0.196)
## alcohol 0.361*** 0.346*** 0.309***
## (0.017) (0.016) (0.016)
## sulphates 0.994*** 0.679***
## (0.102) (0.101)
## volatile.acidity -1.221***
## (0.097)
## --------------------------------------------------------------
## R-squared 0.227 0.270 0.336
## adj. R-squared 0.226 0.269 0.335
## sigma 0.710 0.690 0.659
## F 468.267 294.988 268.912
## p 0.000 0.000 0.000
## Log-likelihood -1721.057 -1675.142 -1599.384
## Deviance 805.870 760.894 692.105
## AIC 3448.114 3358.284 3208.768
## BIC 3464.245 3379.793 3235.654
## N 1599 1599 1599
## ==============================================================
From the table, we know that even with nice . The model can only explains 33.6% of quality datas.
Sulphates in Quality by volatile.acidity plot. The colors in the plot explains everything: light on the top-left and condense on bottom-right.
Yes, I created a linear model using alcohol, sulphates and volatile.acisidty. The model is built on the variables that has closer relationship with quality which make the model more robust. The limitation is that since all the variables are in fact not strongly related with quality. As a result, the model only explains about 33% of quality.
The histogram of Alcohol(%) is right-skewed abd without very long-tailed shape which looks more familiar with the distribution of quality ratings.
After taking the jitter and log transition, we can see that alcohol appears to have positive relationship with quality. Volatile.acidity(g/cm^3) appears to have negative relationship with quality.
In this plot, we can clearly see the relationship between volatile.acidity(g/cm^3) and quality is negative. Besides, from the color we can see that light sulphates data (g/cm^3) exists more on the top-left in the plot while condense sulphates data(g/cm^3) appears mostly on the bottom-right in the plot.
Throughout the analysis, I was first blocked by the background knowledge of red wine. For me, all the chemical terms are so unfamiliar. I solved this by Google “red wine chemical compounds”" and read through some articles explaining all the terms.
My second challenge is that after finishing the “Univariate” section, I don’t have a robust evidence of where I should move on to analyze, I can only “guess” what might be interesting to explore based on the shape of the histogram. This is solved after I started the bivariate section and using cor() and ggpairs() functions to find out which variables has strong relationship with quality.
Third, I reach a dead end that even though I apply the log transmition, the data still didn’t look good. I’m not able to find anything interesting based on the plto. The issue was solved after I go back to watch the courses and applying jitter() to my plot. After adding this function, the plot looks better and I’m able to tell something in the data.
Besides, there might be other factors that effects the results of the analysis: - Country/Region: where each red wine is produced. - Storage: is the wine stored properly? - Year of production: this might effects the quality a lot.
In addition, since all the variables are not strongly correlated with quality, the model may be biased.